add: ModelOpt Launcher for Slurm job submission#1031
add: ModelOpt Launcher for Slurm job submission#1031
Conversation
Add launcher/ module with launch.py that submits quantization, training, and evaluation jobs to Slurm clusters via nemo-run. Produces identical code/ layout as nmm-sandbox's slurm.py so the same YAML configs work in both. Includes Megatron-LM and Model-Optimizer as submodules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces a comprehensive ModelOpt launcher framework for submitting quantization, training, and evaluation jobs to Slurm clusters or Docker environments. Includes core orchestration logic, configuration models, factory-based task creation, example YAML pipelines for Qwen3-8B, and shell/Python utilities for data synthesis, model quantization, and speculative decoding workflows. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Is there any way to use ModelOpt from the repo root instead of adding additional submodule?
There was a problem hiding this comment.
Yes, I iterated several versions. Other solutions can be relative import or soft reference. The benefit will be no to submodule ModelOpt.
Cons:
- relative import; the nemo-run packager cannot handle this well. So in the
code/section (nemo-run copies the entire code for every experiment), modelopt source code cannot quite maintain its original structure. In the internal sandbox, ModelOpt is in modules/Model-Optimizer. So if it is different here, we need to fix a lot of path difference. - soft reference; this should be doable. Of course, users will need to create the soft reference. I also haven't tested fully.
There was a problem hiding this comment.
Update; I change it to soft-reference. so gitclone should keep this relative symlink. Even if it is accidentally deleted, launch.py will create it.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1031 +/- ##
==========================================
- Coverage 71.73% 70.09% -1.65%
==========================================
Files 211 221 +10
Lines 23949 25541 +1592
==========================================
+ Hits 17181 17902 +721
- Misses 6768 7639 +871 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitmodules:
- Around line 4-6: Remove the self-referential submodule entry for
"launcher/modules/Model-Optimizer" in .gitmodules (the block with path =
launcher/modules/Model-Optimizer and url =
https://github.com/NVIDIA/Model-Optimizer.git); instead rely on the top-level
Model-Optimizer working tree for packaging/mounting and remove any code that
updates the nested gitlink for launcher/modules/Model-Optimizer so you no longer
create or expect a nested checkout.
- Around line 1-3: The .gitmodules entry for the submodule at path
"launcher/modules/Megatron-LM" currently points to the personal fork URL
"https://github.com/AAnoosheh/Megatron-LM.git"; update that submodule URL to the
canonical upstream "https://github.com/NVIDIA/Megatron-LM.git" (and then run git
submodule sync && git submodule update --init --recursive) so the
launcher/modules/Megatron-LM submodule references NVIDIA/Megatron-LM instead of
the personal fork.
In `@launcher/launch.py`:
- Around line 87-89: The launcher sets container_mounts to mount the host cache
at /hf-local (variable container_mounts, local hf_local), but the code still
defaults HF_HOME to /hf-cache so the mounted cache is ignored; update places
that set the HF_HOME default (where container_mounts is None and the other
occurrences at the blocks referenced around lines 111-122 and 330-336) to align
HF_HOME with hf_local (i.e., use the hf_local value or "/hf-local" as the
HF_HOME default) so the injected environment points HuggingFace to the actual
mounted path.
- Around line 377-379: The code sets NEMORUN_HOME via
run.config.set_nemorun_home but still writes/reads metadata.json relative to the
process CWD; update all metadata file operations to use the configured NeMo Run
home instead of os.getcwd(): replace os.getcwd()/bare relative paths with
run.config.get_nemorun_home() (or run.config.nemorun_home) when constructing
paths for metadata.json and any related experiment directories (e.g., where
metadata is written around run.config.set_nemorun_home and the similar block at
the other location), ensuring both reads and writes use the same configured
home.
- Around line 407-425: The issue is that DEFAULT_LOCAL_ENV and DEFAULT_SLURM_ENV
are applied after task_env, causing task-level environment keys to be
overwritten; in the block that builds task_env (around task.environment,
hf_local, get_docker_executor and get_slurm_executor) apply the defaults before
merging task values or merge them using defaults as base (e.g., start with a
copy of DEFAULT_LOCAL_ENV/DEFAULT_SLURM_ENV then update with task.environment)
or use setdefault semantics so that explicit task.environment entries (HF_HOME,
HF_TOKEN, MLM_SKIP_INSTALL, LAUNCH_SCRIPT) override the launcher defaults;
adjust the update order where task_env.update(DEFAULT_LOCAL_ENV) /
task_env.update(DEFAULT_SLURM_ENV) currently occur so task values take
precedence.
- Around line 171-173: The code currently looks up a callable via globals()
using function_name and invokes it—replace this with an explicit allowlist
factory map (e.g., a dict mapping allowed factory names to their functions) and
use that map to resolve the factory instead of globals(); in the block that sets
function_name and slurm_config, validate that function_name is present in the
factory map, raise a clear error if not allowed, then call the resolved factory
with slurm_config (preserving the existing use of config_from_yaml["script"],
function_name, and slurm_config variables) to produce the slurm config.
- Around line 250-270: packager is using LAUNCHER_DIR for all include_pattern
entries, causing "services/*" and "tests/*" (which live at the repository root /
MODELOPT_ROOT) to be skipped; update the packager call so the relative_path list
maps the first five patterns to LAUNCHER_DIR and the last two patterns
("services/*", "tests/*") to MODELOPT_ROOT (i.e., make relative_path =
[LAUNCHER_DIR]*5 + [MODELOPT_ROOT]*2) so the include_pattern entries resolve to
the correct directories; keep the include_pattern list unchanged and only adjust
the relative_path mapping in the packager invocation.
- Around line 417-424: Replace direct uses of the internal Experiment._id:
implement a small helper get_experiment_id(exp) that obtains the experiment id
via supported public APIs (try exp.id, then exp.get_id(), then
exp.metadata.get("id") or exp.to_dict().get("id")) and raise a clear error
instructing to update nemo_run if none exist; then call get_experiment_id(exp)
instead of exp._id in get_docker_executor(...) and get_slurm_executor(...) (and
the other occurrences referenced) so the launcher no longer depends on the
private _id attribute.
In `@launcher/modules/Megatron-LM`:
- Line 1: Document why the launcher submodule uses the personal fork
https://github.com/AAnoosheh/Megatron-LM.git (branch
aanuosh…/modelopt-example-import-order) by adding a short rationale in the repo
(e.g., README or launcher packager docs): list the exact diffs/changes required
from the fork that the launcher depends on (identify files/functions/patches),
explain why those changes cannot live in NVIDIA/Megatron-LM as-is, state whether
and when the changes will be upstreamed or removed (upstream PR IDs or
timeline), and describe the maintenance plan for this fork (how it will be
rebased/merged with upstream security fixes and who owns it). Ensure references
to the submodule URL, branch name, and the launcher packager are included so
reviewers can locate the dependency and the specific modifications.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: af7e3654-7a2f-40c2-9b19-f85cc6e8f625
📒 Files selected for processing (6)
.gitmoduleslauncher/__init__.pylauncher/launch.pylauncher/modules/Megatron-LMlauncher/modules/Model-Optimizerlauncher/pyproject.toml
Extract shared logic (dataclasses, executor builders, run loop, version reporting) into core.py. Both launch.py and nmm-sandbox's slurm.py import from core.py to avoid divergence. Add slurm_config.py with generic env-var-driven factory, service scripts, Qwen3-8B PTQ example, and README with usage, flags, and bug reporting instructions. Verified: same YAML produces identical MMLU 0.736 on OCI-HSG and 0.719 locally via both slurm.py and launch.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (7)
launcher/services/megatron-lm/quantize/quantize.sh (2)
46-47:exit_handleris called with an argument it doesn't accept.
exit_handlerinservice_utils.shtakes no parameters, but it's called with$0. This is harmless but inconsistent with its definition.🔧 Suggested fix
# This function handles the exit status (fails the CI). -exit_handler $0 +exit_handler🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 46 - 47, The call to exit_handler in quantize.sh passes an unused argument ($0) even though exit_handler (defined in service_utils.sh) accepts no parameters; update the call in quantize.sh to invoke exit_handler with no arguments (replace "exit_handler $0" with "exit_handler") so it matches the function signature and removes the inconsistency while keeping behavior unchanged.
33-36: Remove or document unusedCONVERT_EXEandEXPORT_EXE.These variables are defined but never used. If they're placeholders for future workflow steps, add a comment; otherwise remove them to reduce confusion.
🔧 Suggested fix
QUANTIZE_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/quantize.sh" MMLU_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/mmlu.sh" -CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh" -EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh" +# TODO: Add convert and export steps +# CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh" +# EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 33 - 36, CONVERT_EXE and EXPORT_EXE are defined but never used; either remove their definitions or document them as placeholders. Edit the file to either delete the lines defining CONVERT_EXE and EXPORT_EXE, or add a brief comment above them stating they are intentional placeholders for future steps (e.g., "placeholder for future convert/export scripts") and keep the variables if needed; leave QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change keeps the script consistent (no unused vars unless explicitly documented).launcher/services/service_utils.sh (2)
24-25: Remove or use theFAILvariable.
FAILis set inerror_handler(line 34) but never read—onlyFAIL_EXITis used. Either removeFAILor export it if external scripts rely on it.🔧 Suggested fix
-FAIL=0 FAIL_EXIT=0Or if needed externally:
-FAIL=0 +export FAIL=0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/service_utils.sh` around lines 24 - 25, The variable FAIL is assigned but never read; update the code path around the error_handler and variable declarations to either remove the unused FAIL variable or export/consume it where intended: if other scripts rely on it, export FAIL (e.g., export FAIL) and ensure error_handler sets it; otherwise delete FAIL and only keep FAIL_EXIT. Locate the declarations FAIL and FAIL_EXIT and the error_handler function to make the change.
50-54: Pin thediskcacheversion for reproducibility.Installing
diskcachewithout a version specifier can lead to non-deterministic builds and potential compatibility issues.🔧 Suggested fix
function util_install_extra_dep { if [[ "$mpi_local_rank" -eq 0 ]]; then - pip install diskcache + pip install 'diskcache>=5.6,<6.0' fi }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/service_utils.sh` around lines 50 - 54, The pip install call in function util_install_extra_dep currently installs diskcache without a version; update util_install_extra_dep to pin diskcache to a specific, tested version (e.g., replace the pip install diskcache invocation with a pinned install like pip install diskcache==<VERSION>) or read the version from a single constant/variable (e.g., DISKCACHE_VERSION) so builds are deterministic; keep the mpi_local_rank guard unchanged.launcher/pyproject.toml (1)
6-9: Consider pinningpyyamlversion for reproducibility.
nemo-runis pinned to a specific commit hash, butpyyamlis unpinned. For consistent, reproducible builds, consider pinning to a specific version (e.g.,pyyaml>=6.0,<7.0or an exact version).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/pyproject.toml` around lines 6 - 9, The dependency list in pyproject.toml pins "nemo-run" to a commit but leaves "pyyaml" unpinned, risking non-reproducible builds; update the dependencies array to pin "pyyaml" to a specific range or exact version (for example change the "pyyaml" entry in the dependencies list to something like "pyyaml>=6.0,<7.0" or an exact "pyyaml==6.0.1") so that the dependencies block in pyproject.toml deterministically resolves.launcher/slurm_config.py (2)
58-61: Mutable default arguments in function signature.
container_mountsandsrun_argsuse list literals as default values. In Python, mutable defaults are evaluated once at function definition time and shared across all calls, which can lead to unexpected behavior if the lists are modified.🔧 Suggested fix
def slurm_factory( host: str = os.environ.get("SLURM_HOST", ""), account: str = os.environ.get("SLURM_ACCOUNT", ""), partition: str = "batch", nodes: int = 1, ntasks_per_node: int = 1, gpus_per_node: int = 1, container: str = "nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5", modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt", - container_mounts: list[str] = [ - "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")), - ], - srun_args: list[str] = ["--no-container-mount-home"], + container_mounts: list[str] | None = None, + srun_args: list[str] | None = None, array: str = None, # noqa: RUF013 ) -> SlurmConfig: """Generic Slurm factory — configure via environment variables or CLI overrides.""" + if container_mounts is None: + container_mounts = [ + "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")), + ] + if srun_args is None: + srun_args = ["--no-container-mount-home"] return SlurmConfig(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/slurm_config.py` around lines 58 - 61, The defaults for container_mounts and srun_args are currently mutable list literals; change their default values to None in the function or constructor signature (e.g., container_mounts: Optional[list[str]] = None, srun_args: Optional[list[str]] = None) and inside the function initialize them to fresh lists if None (create the container_mounts list using "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set srun_args to ["--no-container-mount-home"]) so each call gets its own list and avoids shared mutable state; update any references to these parameters accordingly (look for container_mounts and srun_args in the function/class where they are defined).
32-44: Type hints don't match default values.Several fields have type hints (e.g.,
str,list[str]) but default toNone. Consider using union types for clarity:🔧 Suggested fix
`@dataclass` class SlurmConfig: - host: str = None + host: str | None = None port: int = 22 - account: str = None + account: str | None = None partition: str = "batch" - container: str = None + container: str | None = None modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt" - container_mounts: list[str] = None - srun_args: list[str] = None - array: str = None + container_mounts: list[str] | None = None + srun_args: list[str] | None = None + array: str | None = None nodes: int = 1 ntasks_per_node: int = 1 gpus_per_node: int = 1 local: bool = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/slurm_config.py` around lines 32 - 44, The annotated types for several dataclass/config fields (host, account, container, container_mounts, srun_args, array) conflict with their None defaults; update their type hints to allow None (e.g., use Optional[str] for host/account/container/array and Optional[list[str]] for container_mounts/srun_args) and add the necessary import from typing (Optional) or use union syntax (str | None, list[str] | None) so the annotation matches the default values; keep the existing defaults (None) and leave other fields (port, nodes, ntasks_per_node, gpus_per_node, modelopt_install_path, local, partition) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@launcher/core.py`:
- Around line 313-329: Remove the inline "# nosec" comments: delete the "# nosec
B404" on the "import subprocess" line and the "# nosec B603 B607" comments on
the subprocess.run calls that set the commit and branch variables (the lines
invoking git rev-parse for commit and branch). Keep the subprocess calls as-is
(list args, no shell=True, capture_output/text/timeout) and instead add to your
PR description or a nearby code comment a request for formal review by
`@NVIDIA/modelopt-setup-codeowners` with documented justification for why these
subprocess usages are safe.
In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Line 38: The assignment to MLM_EXTRA_ARGS incorrectly expands "${@}" into a
scalar; change it to either capture all args as a single string using "$*" (if
you want one string) or make MLM_EXTRA_ARGS an array by declaring
MLM_EXTRA_ARGS=("$@") (if you need separate elements), and update subsequent
uses of MLM_EXTRA_ARGS accordingly so they treat it as a string or an array as
intended.
In `@launcher/services/service_utils.sh`:
- Around line 59-62: Replace the runtime append of __version__ from the script:
stop modifying modules/Model-Optimizer/modelopt/__init__.py at runtime and
instead write/read a dedicated version file (e.g.,
modules/Model-Optimizer/VERSION) or inject the version at build-time; remove the
block that checks mpi_local_rank and echoes "__version__ = '1.0.0'"; implement a
safe creation/update step that checks for an existing VERSION file (or existing
__version__ entry) before writing and ensure the write happens in a
build/pre-deploy step or by a single global coordinator (not per-node
mpi_local_rank==0) to avoid race conditions and git dirty state.
---
Nitpick comments:
In `@launcher/pyproject.toml`:
- Around line 6-9: The dependency list in pyproject.toml pins "nemo-run" to a
commit but leaves "pyyaml" unpinned, risking non-reproducible builds; update the
dependencies array to pin "pyyaml" to a specific range or exact version (for
example change the "pyyaml" entry in the dependencies list to something like
"pyyaml>=6.0,<7.0" or an exact "pyyaml==6.0.1") so that the dependencies block
in pyproject.toml deterministically resolves.
In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Around line 46-47: The call to exit_handler in quantize.sh passes an unused
argument ($0) even though exit_handler (defined in service_utils.sh) accepts no
parameters; update the call in quantize.sh to invoke exit_handler with no
arguments (replace "exit_handler $0" with "exit_handler") so it matches the
function signature and removes the inconsistency while keeping behavior
unchanged.
- Around line 33-36: CONVERT_EXE and EXPORT_EXE are defined but never used;
either remove their definitions or document them as placeholders. Edit the file
to either delete the lines defining CONVERT_EXE and EXPORT_EXE, or add a brief
comment above them stating they are intentional placeholders for future steps
(e.g., "placeholder for future convert/export scripts") and keep the variables
if needed; leave QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change
keeps the script consistent (no unused vars unless explicitly documented).
In `@launcher/services/service_utils.sh`:
- Around line 24-25: The variable FAIL is assigned but never read; update the
code path around the error_handler and variable declarations to either remove
the unused FAIL variable or export/consume it where intended: if other scripts
rely on it, export FAIL (e.g., export FAIL) and ensure error_handler sets it;
otherwise delete FAIL and only keep FAIL_EXIT. Locate the declarations FAIL and
FAIL_EXIT and the error_handler function to make the change.
- Around line 50-54: The pip install call in function util_install_extra_dep
currently installs diskcache without a version; update util_install_extra_dep to
pin diskcache to a specific, tested version (e.g., replace the pip install
diskcache invocation with a pinned install like pip install
diskcache==<VERSION>) or read the version from a single constant/variable (e.g.,
DISKCACHE_VERSION) so builds are deterministic; keep the mpi_local_rank guard
unchanged.
In `@launcher/slurm_config.py`:
- Around line 58-61: The defaults for container_mounts and srun_args are
currently mutable list literals; change their default values to None in the
function or constructor signature (e.g., container_mounts: Optional[list[str]] =
None, srun_args: Optional[list[str]] = None) and inside the function initialize
them to fresh lists if None (create the container_mounts list using
"{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set
srun_args to ["--no-container-mount-home"]) so each call gets its own list and
avoids shared mutable state; update any references to these parameters
accordingly (look for container_mounts and srun_args in the function/class where
they are defined).
- Around line 32-44: The annotated types for several dataclass/config fields
(host, account, container, container_mounts, srun_args, array) conflict with
their None defaults; update their type hints to allow None (e.g., use
Optional[str] for host/account/container/array and Optional[list[str]] for
container_mounts/srun_args) and add the necessary import from typing (Optional)
or use union syntax (str | None, list[str] | None) so the annotation matches the
default values; keep the existing defaults (None) and leave other fields (port,
nodes, ntasks_per_node, gpus_per_node, modelopt_install_path, local, partition)
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 56d56135-8c67-4a05-b578-e10120a5789f
📒 Files selected for processing (8)
launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yamllauncher/README.mdlauncher/core.pylauncher/launch.pylauncher/pyproject.tomllauncher/services/megatron-lm/quantize/quantize.shlauncher/services/service_utils.shlauncher/slurm_config.py
✅ Files skipped from review due to trivial changes (1)
- launcher/README.md
| import subprocess # nosec B404 | ||
|
|
||
| try: | ||
| commit = subprocess.run( # nosec B603 B607 | ||
| ["git", "rev-parse", "--short", "HEAD"], | ||
| cwd=path, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=5, | ||
| ).stdout.strip() | ||
| branch = subprocess.run( # nosec B603 B607 | ||
| ["git", "rev-parse", "--abbrev-ref", "HEAD"], | ||
| cwd=path, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=5, | ||
| ).stdout.strip() |
There was a problem hiding this comment.
Remove # nosec comments per coding guidelines.
The coding guidelines state: "# nosec comments are not allowed as bypasses for security checks; contributors must request formal review from @NVIDIA/modelopt-setup-codeowners with documented justification instead."
The subprocess calls here are safe (list arguments, no shell=True, hardcoded commands), but the bypass comments should be removed and the pattern formally reviewed.
🔧 Suggested fix
def _git_info(path):
"""Get git commit hash and branch for a directory."""
- import subprocess # nosec B404
+ import subprocess
try:
- commit = subprocess.run( # nosec B603 B607
+ commit = subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
- branch = subprocess.run( # nosec B603 B607
+ branch = subprocess.run(
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()As per coding guidelines: "Disallow use of '# nosec' comments to bypass Bandit security checks."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import subprocess # nosec B404 | |
| try: | |
| commit = subprocess.run( # nosec B603 B607 | |
| ["git", "rev-parse", "--short", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() | |
| branch = subprocess.run( # nosec B603 B607 | |
| ["git", "rev-parse", "--abbrev-ref", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() | |
| import subprocess | |
| try: | |
| commit = subprocess.run( | |
| ["git", "rev-parse", "--short", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() | |
| branch = subprocess.run( | |
| ["git", "rev-parse", "--abbrev-ref", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/core.py` around lines 313 - 329, Remove the inline "# nosec"
comments: delete the "# nosec B404" on the "import subprocess" line and the "#
nosec B603 B607" comments on the subprocess.run calls that set the commit and
branch variables (the lines invoking git rev-parse for commit and branch). Keep
the subprocess calls as-is (list args, no shell=True,
capture_output/text/timeout) and instead add to your PR description or a nearby
code comment a request for formal review by `@NVIDIA/modelopt-setup-codeowners`
with documented justification for why these subprocess usages are safe.
| # Increase the modelopt version number manually | ||
| if [[ "$mpi_local_rank" -eq 0 ]]; then | ||
| echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py | ||
| fi |
There was a problem hiding this comment.
Runtime modification of __init__.py is fragile and may cause race conditions.
Appending __version__ to the modelopt __init__.py at runtime:
- Modifies source files within a submodule, which can cause git dirty state
- May race with other MPI ranks despite the
mpi_local_rank == 0guard (multiple nodes each have a local rank 0) - Doesn't check if
__version__already exists, causing duplicate entries on repeated runs
Consider moving version management to build-time or using a separate version file that's explicitly managed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/services/service_utils.sh` around lines 59 - 62, Replace the runtime
append of __version__ from the script: stop modifying
modules/Model-Optimizer/modelopt/__init__.py at runtime and instead write/read a
dedicated version file (e.g., modules/Model-Optimizer/VERSION) or inject the
version at build-time; remove the block that checks mpi_local_rank and echoes
"__version__ = '1.0.0'"; implement a safe creation/update step that checks for
an existing VERSION file (or existing __version__ entry) before writing and
ensure the write happens in a build/pre-deploy step or by a single global
coordinator (not per-node mpi_local_rank==0) to avoid race conditions and git
dirty state.
There was a problem hiding this comment.
Thoughts on making top-level folder name generic like tools inside which we have launcher?
There was a problem hiding this comment.
Didn't quite get it. Could you be more clear?
| @@ -0,0 +1,13 @@ | |||
| script: services/megatron-lm/quantize/quantize.sh | |||
| args: | |||
| - --calib-dataset-path-or-name /hf-local/abisee/cnn_dailymail | |||
There was a problem hiding this comment.
Can we mention in README what hf-local is? Or for simplicity we can just make it HF path abisee/cnn_dailymail
|
|
||
| # Increase the modelopt version number manually | ||
| if [[ "$mpi_local_rank" -eq 0 ]]; then | ||
| echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py |
There was a problem hiding this comment.
why do we need to modify ModelOpt version number?
There was a problem hiding this comment.
This is a WAR since we are mounting instead of pip installing. So the version will remain the same which will fail a lot of version check in modelopt.torch. Any suggestion?
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (4)
launcher/core.py (4)
432-455:⚠️ Potential issue | 🟠 MajorPreserve task-level env overrides when applying defaults.
On Line 443 and Line 455, defaults overwrite task-specific values. Reverse the merge precedence so YAML task env wins.
Suggested change
- task_env.update(default_local_env) + task_env = {**default_local_env, **task_env} @@ - task_env.update(default_slurm_env) + task_env = {**default_slurm_env, **task_env}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 432 - 455, The current merges call task_env.update(default_local_env) / task_env.update(default_slurm_env) which lets defaults overwrite task-specific YAML values; instead, apply defaults first then layer task-level env on top so task YAML wins—e.g., obtain the appropriate default dict for the branch (for docker: default_local_env; for slurm: default_slurm_env), copy/merge defaults first and then update with task_env (or update a copy of defaults with task_env) before passing to build_docker_executor / build_slurm_executor; target the hf_local branch and the calls around build_docker_executor, build_slurm_executor and the task_env.update(...) lines.
482-485:⚠️ Potential issue | 🟠 MajorWrite metadata under the configured NeMo Run home, not CWD-relative.
metadata.jsonis currently written to a relative path on Line 482, which can diverge from the configuredNEMORUN_HOMEset by the launcher.Suggested change
-def run_jobs( +def run_jobs( job_table, @@ modelopt_src_path=None, base_dir=None, + nemorun_home=None, ): @@ - metadata_path = os.path.join("experiments", experiment_title, exp._id, "metadata.json") + root = nemorun_home or os.environ.get("NEMORUN_HOME", os.getcwd()) + metadata_path = os.path.join(root, "experiments", experiment_title, exp._id, "metadata.json")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 482 - 485, The code writes metadata.json to a CWD-relative path using metadata_path built from "experiments", experiment_title and exp._id; change it to write under the configured NeMo Run home instead—build metadata_path by joining the launcher’s configured NeMo Run home variable (e.g., NEMORUN_HOME or the instance/home attribute used by the launcher) with "experiments", experiment_title and exp._id, ensure os.makedirs uses os.path.dirname(metadata_path) and then json.dump metadata to that path so files are created under the configured NeMo Run home rather than the current working directory.
436-450:⚠️ Potential issue | 🟠 MajorAvoid relying on
run.Experiment._idprivate API.Using
_idcouples this code to an internal nemo_run attribute and risks breakage on dependency updates.In the currently used nemo_run version, what is the supported public API (if any) for retrieving an Experiment identifier after creation, and is `_id` considered internal-only?Also applies to: 477-482
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 436 - 450, Replace uses of the private attribute run.Experiment._id with the official public Experiment identifier accessor from the nemo_run API (for example exp.id or exp.get_id(), whichever the library exposes) everywhere it's passed (e.g., to build_slurm_executor and the local executor call sites referenced around build_slurm_executor and the earlier local executor block). Locate all occurrences of exp._id (including the other block noted) and switch them to the public accessor; if the library requires calling a method at creation time, capture and reuse that returned id instead of reading a private field. Ensure the code compiles and tests that rely on the experiment id are updated to use the new accessor.
321-332:⚠️ Potential issue | 🟠 MajorRemove Bandit bypass comments from subprocess calls.
# nosecon Line 321, Line 324, and Line 331 violates repo policy even when the subprocess usage is otherwise safe.Suggested change
- import subprocess # nosec B404 + import subprocess @@ - commit = subprocess.run( # nosec B603 B607 + commit = subprocess.run( @@ - branch = subprocess.run( # nosec B603 B607 + branch = subprocess.run(As per coding guidelines, "Disallow use of '# nosec' comments to bypass Bandit security checks."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 321 - 332, Remove the "# nosec" comments on the subprocess import and the subprocess.run calls (the lines that set commit and branch), and instead make the calls explicitly safe: keep the argument lists (["git", "rev-parse", ...]) to avoid shell=True, add check=True to both subprocess.run calls and handle subprocess.CalledProcessError (or subprocess.SubprocessError) to log/raise a clear error on failure, and optionally validate git exists with shutil.which("git") before invoking; update the code around variables commit and branch to use this checked/run-and-handle pattern so Bandit suppression comments are no longer needed.
🧹 Nitpick comments (1)
launcher/core.py (1)
133-135: Fail fast with a clear error for unknown/missing YAML factory names.Raw
KeyErrorfromfactory_lookup[function_name]is hard to diagnose for config authors.Suggested change
- function_name = config_from_yaml["slurm_config"].pop("_factory_") - slurm_config = factory_lookup[function_name](**config_from_yaml["slurm_config"]) + function_name = config_from_yaml["slurm_config"].pop("_factory_", None) + if not function_name: + raise ValueError(f"Missing '_factory_' in slurm_config for YAML: {yaml_file}") + try: + factory = factory_lookup[function_name] + except KeyError as exc: + raise ValueError(f"Unsupported slurm_config factory '{function_name}' in {yaml_file}") from exc + slurm_config = factory(**config_from_yaml["slurm_config"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 133 - 135, The code currently assumes config_from_yaml["slurm_config"]["_factory_"] exists and that factory_lookup[function_name] will succeed, which yields an opaque KeyError for config authors; update the block around function_name, factory_lookup, and slurm_config to explicitly check for the "_factory_" key on config_from_yaml["slurm_config"] and that function_name is present in factory_lookup, and if not raise a clear ValueError (or custom exception) that states the missing or unknown factory name and lists available keys from factory_lookup so the user can correct their YAML; keep subsequent behavior of calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and preserving args = config_from_yaml.get("args", None).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@launcher/core.py`:
- Around line 432-455: The current merges call
task_env.update(default_local_env) / task_env.update(default_slurm_env) which
lets defaults overwrite task-specific YAML values; instead, apply defaults first
then layer task-level env on top so task YAML wins—e.g., obtain the appropriate
default dict for the branch (for docker: default_local_env; for slurm:
default_slurm_env), copy/merge defaults first and then update with task_env (or
update a copy of defaults with task_env) before passing to build_docker_executor
/ build_slurm_executor; target the hf_local branch and the calls around
build_docker_executor, build_slurm_executor and the task_env.update(...) lines.
- Around line 482-485: The code writes metadata.json to a CWD-relative path
using metadata_path built from "experiments", experiment_title and exp._id;
change it to write under the configured NeMo Run home instead—build
metadata_path by joining the launcher’s configured NeMo Run home variable (e.g.,
NEMORUN_HOME or the instance/home attribute used by the launcher) with
"experiments", experiment_title and exp._id, ensure os.makedirs uses
os.path.dirname(metadata_path) and then json.dump metadata to that path so files
are created under the configured NeMo Run home rather than the current working
directory.
- Around line 436-450: Replace uses of the private attribute run.Experiment._id
with the official public Experiment identifier accessor from the nemo_run API
(for example exp.id or exp.get_id(), whichever the library exposes) everywhere
it's passed (e.g., to build_slurm_executor and the local executor call sites
referenced around build_slurm_executor and the earlier local executor block).
Locate all occurrences of exp._id (including the other block noted) and switch
them to the public accessor; if the library requires calling a method at
creation time, capture and reuse that returned id instead of reading a private
field. Ensure the code compiles and tests that rely on the experiment id are
updated to use the new accessor.
- Around line 321-332: Remove the "# nosec" comments on the subprocess import
and the subprocess.run calls (the lines that set commit and branch), and instead
make the calls explicitly safe: keep the argument lists (["git", "rev-parse",
...]) to avoid shell=True, add check=True to both subprocess.run calls and
handle subprocess.CalledProcessError (or subprocess.SubprocessError) to
log/raise a clear error on failure, and optionally validate git exists with
shutil.which("git") before invoking; update the code around variables commit and
branch to use this checked/run-and-handle pattern so Bandit suppression comments
are no longer needed.
---
Nitpick comments:
In `@launcher/core.py`:
- Around line 133-135: The code currently assumes
config_from_yaml["slurm_config"]["_factory_"] exists and that
factory_lookup[function_name] will succeed, which yields an opaque KeyError for
config authors; update the block around function_name, factory_lookup, and
slurm_config to explicitly check for the "_factory_" key on
config_from_yaml["slurm_config"] and that function_name is present in
factory_lookup, and if not raise a clear ValueError (or custom exception) that
states the missing or unknown factory name and lists available keys from
factory_lookup so the user can correct their YAML; keep subsequent behavior of
calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and
preserving args = config_from_yaml.get("args", None).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ce5b1f59-c0be-4a83-9853-3cff952f9271
📒 Files selected for processing (2)
launcher/core.pylauncher/launch.py
launch.py now only accepts pipeline=@ or --yaml. Update README with --yaml vs pipeline=@ docs, useful flags, and bug reporting. Update Qwen3-8B config to new --yaml format with job_name + pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
launcher/launch.py (2)
50-50:⚠️ Potential issue | 🟠 MajorDefault
HF_HOMEstill appears inconsistent with/hf-localmounts.
get_default_env(EXPERIMENT_TITLE)yieldsHF_HOME=/<title>/hf-cache, while the Slurm factory mount target is/hf-local. That means cache reuse can be missed unless every task overridesHF_HOME.🔧 Proposed fix
DEFAULT_SLURM_ENV, DEFAULT_LOCAL_ENV = get_default_env(EXPERIMENT_TITLE) +DEFAULT_SLURM_ENV["HF_HOME"] = "/hf-local" +DEFAULT_LOCAL_ENV["HF_HOME"] = "/hf-local"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` at line 50, The default HF_HOME produced by get_default_env(EXPERIMENT_TITLE) is set to /<title>/hf-cache which does not match the Slurm factory mount target /hf-local, causing cache reuse to be missed; update get_default_env (or the DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV initialization) to set HF_HOME to the mounted path /hf-local (or derive HF_HOME from the Slurm mount constant) so that DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV use the same HF_HOME as the Slurm mounts and tasks inherit the consistent cache location.
52-62:⚠️ Potential issue | 🟠 Major
services/*packaging may resolve from the wrong base directory.
include_patternincludesservices/*, butrelative_pathis fully anchored toLAUNCHER_DIR. Ifservices/is at repo root, this silently excludes it from packaging.🔧 Proposed fix
packager = run.PatternPackager( @@ - relative_path=[LAUNCHER_DIR] * 6, + relative_path=[LAUNCHER_DIR] * 5 + [MODELOPT_ROOT], )#!/bin/bash set -euo pipefail echo "== services directories in repository ==" fd -HI -td '^services$' . echo echo "== Does launcher/services exist? ==" if [ -d launcher/services ]; then echo "launcher/services exists" else echo "launcher/services missing" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` around lines 52 - 62, The packaging include_pattern lists "services/*" while relative_path is set to [LAUNCHER_DIR] * 6, which will exclude repo-root services if they live outside LAUNCHER_DIR; update the run.PatternPackager invocation so the relative_path entry that pairs with the "services/*" pattern points to the correct base (e.g. repo root) instead of LAUNCHER_DIR — locate the run.PatternPackager call and change the corresponding element in relative_path (or compute relative_path dynamically) so the "services/*" pattern resolves from the actual services directory.
🧹 Nitpick comments (2)
launcher/launch.py (2)
74-76: Avoid unconditionally overriding caller-providedjob_dirin local mode.Line 87-Line 89 always rewrites
job_dirwhenhf_localis set, which drops explicit user input.🔧 Proposed refactor
def launch( job_name: str = "01_job", - job_dir: str = os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")), + job_dir: str | None = None, @@ ) -> None: @@ - if hf_local is not None: - job_dir = os.path.join(os.getcwd(), "local_experiments") + if hf_local is not None: + job_dir = job_dir or os.path.join(os.getcwd(), "local_experiments") + else: + job_dir = job_dir or os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments"))Also applies to: 87-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` around lines 74 - 76, The current behavior unconditionally replaces the caller-provided job_dir when hf_local is true; change this so job_dir is only auto-set when the caller didn't provide one: make job_dir default to None in the function signature (job_dir: Optional[str] = None) and initialize it from os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")) only if job_dir is None, and likewise in the hf_local branch update job_dir only when job_dir is None or empty; update references to the job_dir variable accordingly (look for the job_dir parameter and the hf_local handling block).
36-37: Consider using absolute imports for better package reusability.Lines 36-37 use relative imports (
core,slurm_config), which work for the documented script usage (uv run launch.py), but are fragile if the code is ever imported as a module or if entry points are added topyproject.toml. Sincelauncheris a proper Python package with package infrastructure, absolute imports with fallback improve maintainability.The proposed try-except approach is reasonable:
🔧 Suggested approach
-from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type -from slurm_config import SlurmConfig, slurm_factory +try: + from launcher.core import ( + SandboxPipeline, + get_default_env, + register_factory, + run_jobs, + set_slurm_config_type, + ) + from launcher.slurm_config import SlurmConfig, slurm_factory +except ImportError: + # Script-mode fallback (e.g., running from launcher/ directly) + from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type + from slurm_config import SlurmConfig, slurm_factory🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` around lines 36 - 37, Replace the fragile relative imports in launch.py with absolute-package imports and a fallback to the relative names: attempt to import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type from the top-level package (e.g., launcher.core or package.core) and SlurmConfig, slurm_factory from the package slurm_config, and if that fails in an ImportError, fall back to the current relative imports; locate the import line that currently references core and slurm_config and update it to a try/except ImportError that assigns the same symbols (SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the file keep working regardless of whether the module is run as a script or imported as a package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@launcher/launch.py`:
- Line 50: The default HF_HOME produced by get_default_env(EXPERIMENT_TITLE) is
set to /<title>/hf-cache which does not match the Slurm factory mount target
/hf-local, causing cache reuse to be missed; update get_default_env (or the
DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV initialization) to set HF_HOME to the
mounted path /hf-local (or derive HF_HOME from the Slurm mount constant) so that
DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV use the same HF_HOME as the Slurm mounts
and tasks inherit the consistent cache location.
- Around line 52-62: The packaging include_pattern lists "services/*" while
relative_path is set to [LAUNCHER_DIR] * 6, which will exclude repo-root
services if they live outside LAUNCHER_DIR; update the run.PatternPackager
invocation so the relative_path entry that pairs with the "services/*" pattern
points to the correct base (e.g. repo root) instead of LAUNCHER_DIR — locate the
run.PatternPackager call and change the corresponding element in relative_path
(or compute relative_path dynamically) so the "services/*" pattern resolves from
the actual services directory.
---
Nitpick comments:
In `@launcher/launch.py`:
- Around line 74-76: The current behavior unconditionally replaces the
caller-provided job_dir when hf_local is true; change this so job_dir is only
auto-set when the caller didn't provide one: make job_dir default to None in the
function signature (job_dir: Optional[str] = None) and initialize it from
os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")) only if
job_dir is None, and likewise in the hf_local branch update job_dir only when
job_dir is None or empty; update references to the job_dir variable accordingly
(look for the job_dir parameter and the hf_local handling block).
- Around line 36-37: Replace the fragile relative imports in launch.py with
absolute-package imports and a fallback to the relative names: attempt to import
SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type from the top-level package (e.g., launcher.core or
package.core) and SlurmConfig, slurm_factory from the package slurm_config, and
if that fails in an ImportError, fall back to the current relative imports;
locate the import line that currently references core and slurm_config and
update it to a try/except ImportError that assigns the same symbols
(SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the
file keep working regardless of whether the module is run as a script or
imported as a package.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 07f18aab-eab7-4f3b-812c-8122649ca802
📒 Files selected for processing (3)
launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yamllauncher/README.mdlauncher/launch.py
🚧 Files skipped from review as they are similar to previous changes (2)
- launcher/README.md
- launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
Move service scripts to common/ (query.py, query.sh, eagle3, specdec_bench, megatron-lm quantize). Add Qwen3-8B EAGLE3 offline pipeline YAML. Add ADVANCED.md with architecture docs and Claude Code workflows. Update packager to include common/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 15
♻️ Duplicate comments (2)
launcher/common/megatron-lm/quantize/quantize.sh (1)
38-38:⚠️ Potential issue | 🟡 MinorPreserve CLI args correctly when exporting
MLM_EXTRA_ARGS.Line 38 assigns
${@}to a scalar, which collapses/splits arguments unexpectedly. Use"$*"for a single string export (or a true array if downstream expects elements).🔧 Proposed fix
-export MLM_EXTRA_ARGS=${@} +export MLM_EXTRA_ARGS="$*"#!/bin/bash set -euo pipefail # Verify all MLM_EXTRA_ARGS assignments/usages in shell scripts. rg -n -C2 '\bMLM_EXTRA_ARGS\b' --type=sh🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/common/megatron-lm/quantize/quantize.sh` at line 38, The export of MLM_EXTRA_ARGS collapses positional parameters because it uses ${@}; change it to preserve all CLI args as a single exported string by using quoted expansion: export MLM_EXTRA_ARGS="$*" (or, if downstream expects separate elements, convert to a true bash array with declare -a MLM_EXTRA_ARGS=( "$@" ) and adapt callers accordingly); update any callers that assume an array to handle the chosen representation.launcher/common/service_utils.sh (1)
59-62:⚠️ Potential issue | 🟠 MajorAvoid mutating submodule source files at runtime for versioning.
Line 61 appends to
modules/Model-Optimizer/modelopt/__init__.pyon each run, which can create duplicate__version__entries, race across multi-node jobs, and dirty the working tree.#!/bin/bash set -euo pipefail # Inspect the runtime append logic and current __version__ declarations. sed -n '56,62p' launcher/common/service_utils.sh rg -n '^__version__\s*=' modules/Model-Optimizer/modelopt/__init__.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/common/service_utils.sh` around lines 59 - 62, The current logic in service_utils.sh unconditionally appends "__version__ = '1.0.0'" when mpi_local_rank == 0 which mutates the submodule source repeatedly; change this to be idempotent or avoid editing the submodule: in the block that checks mpi_local_rank, first test the target file for an existing __version__ declaration (grep -q '^__version__' ...) and if present use an in-place replace (sed -i -E "s/^__version__.*/__version__ = '1.0.0'/") otherwise write the single line; alternatively stop modifying the submodule entirely by exporting a MODELOPT_VERSION env var or writing the version into a generated file outside the submodule (instead of echoing into the submodule file). Ensure the check/replace runs only when mpi_local_rank == 0 to preserve the single-writer invariant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@launcher/common/eagle3/dump_offline_data.sh`:
- Around line 38-42: The command invocation in dump_offline_data.sh uses an
unquoted array expansion ${@}, which allows word splitting and globbing; update
the trtllm-llmapi-launch invocation to use the quoted expansion "$@" so argument
boundaries and special characters are preserved (locate the line invoking
trtllm-llmapi-launch in dump_offline_data.sh and replace ${@} with "$@").
In `@launcher/common/eagle3/offline_training.sh`:
- Line 19: The source path is wrong: in offline_training.sh the variable
SCRIPT_DIR points to launcher/common/eagle3/, but service_utils.sh lives one
level up; update the sourcing in offline_training.sh so it references the
correct location (e.g., resolve SCRIPT_DIR/../service_utils.sh or compute the
parent dir via dirname and then source service_utils.sh) — modify the source
line that currently uses SCRIPT_DIR to point to the parent directory where
service_utils.sh actually resides.
- Around line 39-40: Re-enable the final exit handler call so CI sees failures:
restore the commented-out call to exit_handler (call exit_handler $0) at the end
of offline_training.sh so the ERR trap's FAIL_EXIT flag (set earlier by the
trap) is evaluated and the script exits nonzero when failures occur; ensure the
existing ERR trap and FAIL_EXIT behavior from launcher/common/service_utils.sh
remain unchanged and that exit_handler is invoked with the script's $0 name.
- Around line 29-31: The script forwards user args to the downstream launcher
using an unquoted ${@}, which can break arguments with spaces; update the
invocation of launch_train.sh (the line calling bash
modules/Model-Optimizer/examples/speculative_decoding/launch_train.sh and
passing --model ${HF_MODEL_CKPT} ${@}) to pass user arguments as "$@" (and also
ensure HF_MODEL_CKPT is quoted if it can contain spaces) so all positional
parameters are preserved intact when invoking launch_train.sh.
- Line 22: The pip install call in offline_training.sh uses an unquoted
requirement specifier (huggingface-hub>=1.2.1) which can be misinterpreted by
the shell; update the pip install invocation to pass the requirement specifier
as a quoted string (e.g., wrap huggingface-hub>=1.2.1 in single or double
quotes) so the >= is preserved and pip receives the correct requirement.
In `@launcher/common/megatron-lm/quantize/quantize.sh`:
- Around line 35-36: The script defines CONVERT_EXE and EXPORT_EXE but never
runs them; update the quantize.sh flow to execute these stages (e.g., run
"$CONVERT_EXE" then "$EXPORT_EXE") at the appropriate point after prerequisites
(such as after quantization completes), check their exit codes and abort on
failure, and ensure the variables are executable/invoked with bash if needed;
reference the CONVERT_EXE and EXPORT_EXE variables and place the calls in the
main execution sequence (same area where other stage commands are invoked) so
the multi-stage pipeline actually runs.
In `@launcher/common/query.py`:
- Around line 46-52: The except block that sets early_termination to True is
currently creating a local variable, so the module-level flag never changes;
update the except handler in launcher/common/query.py to declare the
module-level variable by adding a global early_termination (or otherwise
reference the module-level symbol) before assigning early_termination = True so
the assignment mutates the shared flag used later in the termination check.
- Around line 102-106: The code mutates msg["content"] in-place which can cause
race conditions when rows are shared (e.g., dataset.map with num_proc>1);
instead create a shallow copy of the message before modifying and append that
copy to current_messages (referencing the variables msg, current_messages and
the enable_thinking branch in launcher/common/query.py) so the original dataset
row is never altered by adding " /no_think".
In `@launcher/common/specdec_bench/quick_check.sh`:
- Around line 24-27: The shell invocation in quick_check.sh uses an unquoted
array expansion `${@}` which allows word splitting and globbing; update the
command that calls `${TRTLLM_LAUNCH_SCRIPT} python3
modules/Model-Optimizer/examples/specdec_bench/run.py --model_dir
${HF_MODEL_CKPT} --tokenizer ${HF_MODEL_CKPT} ${@}` to use the quoted expansion
`"$@"` so that each original argument to the script is preserved exactly (no
splitting or globbing) when passed through.
In `@launcher/common/tensorrt-llm/query.sh`:
- Line 75: The QUERY_ARGS array initialization uses unquoted expansions which
can split words; update the QUERY_ARGS assignment to quote the variable
expansions for TASK_ID and TASK_COUNT so the array elements remain single items
(i.e., use quoted "${TASK_ID}" and "${TASK_COUNT}" when constructing QUERY_ARGS)
— modify the QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step
${TASK_COUNT}) line accordingly.
- Around line 113-115: Replace the unsafe eval invocation that builds a single
interpolated string in the cmd variable and runs eval $cmd with a proper command
array and direct execution to avoid shell injection; construct an array (e.g.,
cmd_arr) containing "python", "common/query.py", "http://localhost:8000/v1",
"$MODEL", and the elements of QUERY_ARGS (use "${QUERY_ARGS[@]}") and then run
it with "${cmd_arr[@]}", removing the eval and quoting variables appropriately
so MODEL and QUERY_ARGS entries are passed as individual arguments.
In `@launcher/common/vllm/query.sh`:
- Line 81: The array initialization QUERY_ARGS=(--shard-id-begin ${TASK_ID}
--shard-id-step ${TASK_COUNT}) should quote the variable expansions to avoid
word splitting; update the QUERY_ARGS assignment to use "--shard-id-begin"
"${TASK_ID}" and "--shard-id-step" "${TASK_COUNT}" (referencing QUERY_ARGS,
TASK_ID, TASK_COUNT) so each option and its value remain single array elements.
- Around line 121-123: The script builds a single interpolated string in cmd and
runs it with eval, which is unsafe; instead construct and invoke the command
without eval by calling python directly with separate arguments (use the script
name and the URL, $MODEL and the QUERY_ARGS array as distinct parameters) so
word-splitting and injection are avoided—replace the cmd variable + eval usage
in launcher/common/vllm/query.sh with a direct invocation of python
common/query.py passing "http://localhost:8000/v1", $MODEL, and
"${QUERY_ARGS[@]}" (or equivalent positional-argument array expansion) so no
eval is required.
In `@launcher/launch.py`:
- Around line 52-63: The include_pattern passed to run.PatternPackager (variable
packager) contains a non-existent "services/*" entry which will never match
files when resolved against LAUNCHER_DIR; remove that pattern or replace it with
the correct existing directory path (e.g., the actual services folder name if it
was renamed) so the packager includes the intended files. Locate the
run.PatternPackager call (packager variable) in launcher/launch.py and either
delete the "services/*" element from include_pattern or update it to the correct
directory name that exists in the repo, then run the packager to verify matches.
In `@launcher/Qwen/Qwen3-8B/hf_offline_eagle3.yaml`:
- Around line 76-95: task_2 writes the draft model to --output_dir
/scratchspace/eagle3 while task_3 reads --draft_model_dir /scratchspace/export;
update one of them so both point to the same directory (e.g., change task_3's
--draft_model_dir to /scratchspace/eagle3 or change task_2's --output_dir to
/scratchspace/export) so task_3 can consume the draft produced by task_2; ensure
you update the corresponding arg in task_2 (the --output_dir setting) or task_3
(the --draft_model_dir argument) in the YAML.
---
Duplicate comments:
In `@launcher/common/megatron-lm/quantize/quantize.sh`:
- Line 38: The export of MLM_EXTRA_ARGS collapses positional parameters because
it uses ${@}; change it to preserve all CLI args as a single exported string by
using quoted expansion: export MLM_EXTRA_ARGS="$*" (or, if downstream expects
separate elements, convert to a true bash array with declare -a MLM_EXTRA_ARGS=(
"$@" ) and adapt callers accordingly); update any callers that assume an array
to handle the chosen representation.
In `@launcher/common/service_utils.sh`:
- Around line 59-62: The current logic in service_utils.sh unconditionally
appends "__version__ = '1.0.0'" when mpi_local_rank == 0 which mutates the
submodule source repeatedly; change this to be idempotent or avoid editing the
submodule: in the block that checks mpi_local_rank, first test the target file
for an existing __version__ declaration (grep -q '^__version__' ...) and if
present use an in-place replace (sed -i -E "s/^__version__.*/__version__ =
'1.0.0'/") otherwise write the single line; alternatively stop modifying the
submodule entirely by exporting a MODELOPT_VERSION env var or writing the
version into a generated file outside the submodule (instead of echoing into the
submodule file). Ensure the check/replace runs only when mpi_local_rank == 0 to
preserve the single-writer invariant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f9ecbe27-9ea5-4b76-bf41-3cbb1f0dcf1c
📒 Files selected for processing (12)
launcher/ADVANCED.mdlauncher/Qwen/Qwen3-8B/hf_offline_eagle3.yamllauncher/Qwen/Qwen3-8B/megatron_lm_ptq.yamllauncher/common/eagle3/dump_offline_data.shlauncher/common/eagle3/offline_training.shlauncher/common/megatron-lm/quantize/quantize.shlauncher/common/query.pylauncher/common/service_utils.shlauncher/common/specdec_bench/quick_check.shlauncher/common/tensorrt-llm/query.shlauncher/common/vllm/query.shlauncher/launch.py
🚧 Files skipped from review as they are similar to previous changes (1)
- launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
| # limitations under the License. | ||
|
|
||
| SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" | ||
| source ${SCRIPT_DIR}/service_utils.sh |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify service_utils.sh locations and the current source statement.
fd 'service_utils\.sh$'
sed -n '18,22p' launcher/common/eagle3/offline_training.shRepository: NVIDIA/Model-Optimizer
Length of output: 304
🏁 Script executed:
#!/bin/bash
# Check all service_utils.sh files and directory structure
echo "=== All service_utils.sh files ==="
fd 'service_utils\.sh$' -x echo {}
echo ""
echo "=== Directory structure under launcher/common/ ==="
ls -la launcher/common/ | head -20
echo ""
echo "=== Contents of launcher/common/eagle3/ ==="
ls -la launcher/common/eagle3/ 2>/dev/null || echo "Directory not found"
echo ""
echo "=== Verify full path of offline_training.sh ==="
find launcher -name "offline_training.sh" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 1106
Fix service_utils.sh source path.
Line 19 attempts to source ${SCRIPT_DIR}/service_utils.sh, but since the script is located at launcher/common/eagle3/offline_training.sh, SCRIPT_DIR resolves to launcher/common/eagle3/. The actual service_utils.sh file is at launcher/common/service_utils.sh (one level up), so the source command will fail. Change to:
-source ${SCRIPT_DIR}/service_utils.sh
+source ${SCRIPT_DIR}/../service_utils.sh📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| source ${SCRIPT_DIR}/service_utils.sh | |
| source ${SCRIPT_DIR}/../service_utils.sh |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/common/eagle3/offline_training.sh` at line 19, The source path is
wrong: in offline_training.sh the variable SCRIPT_DIR points to
launcher/common/eagle3/, but service_utils.sh lives one level up; update the
sourcing in offline_training.sh so it references the correct location (e.g.,
resolve SCRIPT_DIR/../service_utils.sh or compute the parent dir via dirname and
then source service_utils.sh) — modify the source line that currently uses
SCRIPT_DIR to point to the parent directory where service_utils.sh actually
resides.
| # --model is consumed here; args before "--" go to vllm serve, args after go to query.py. | ||
| MODEL="" | ||
| SERVE_EXTRA_ARGS=() | ||
| QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT}) |
There was a problem hiding this comment.
Quote variables in array initialization.
Unquoted variable expansions in array initialization can cause word splitting issues.
🔧 Suggested fix
-QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT})
+QUERY_ARGS=(--shard-id-begin "${TASK_ID}" --shard-id-step "${TASK_COUNT}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT}) | |
| QUERY_ARGS=(--shard-id-begin "${TASK_ID}" --shard-id-step "${TASK_COUNT}") |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 81-81: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 81-81: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/common/vllm/query.sh` at line 81, The array initialization
QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT}) should
quote the variable expansions to avoid word splitting; update the QUERY_ARGS
assignment to use "--shard-id-begin" "${TASK_ID}" and "--shard-id-step"
"${TASK_COUNT}" (referencing QUERY_ARGS, TASK_ID, TASK_COUNT) so each option and
its value remain single array elements.
Add tests/unit/launcher/ with 7 test files covering core dataclasses, factory registry, global_vars, env merging, YAML formats, Docker executor mounts, Slurm executor params (mocked), and end-to-end Docker launch via subprocess. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Remove self-referential launcher/modules/Model-Optimizer submodule (flagged in PR review as creating recursive nesting). Replace with a symlink to ../.. (the Model-Optimizer root). The packager's find follows symlinks so modelopt/* and examples/* are packaged correctly. Verified: Docker launch with symlink works (quantize step finds modelopt). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Add launcher/.gitignore, CLAUDE.md. Update README with hf_local docs, test instructions, verified results. Fix ADVANCED.md stale paths. Add hf_local to GlobalVariables. Use <<global_vars.hf_local>> in YAML. Remove stale services/* from packager. quantize.sh reads MMLU_DATASET env var. launch.py auto-creates Model-Optimizer symlink. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Skip all launcher tests with pytest.skip when nemo_run is not available (CI tox env doesn't have it). Add docstrings to __post_init__ and _resolve for 100% docstring coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Move tests from tests/unit/launcher/ to launcher/tests/ for self-containment. Add launcher job to unit_tests.yml using uv. Add pytest.ini to override root pyproject.toml addopts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Switch from git-pinned nemo-run to nemo-run>=0.8.0 from PyPI (avoids uv TOML parse error). Add py-modules=[] to prevent setuptools auto- discovery. CI installs project with `uv pip install -e . pytest`. Add ModelOpt mount mechanism docs to ADVANCED.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Add launcher/ module with launch.py that submits quantization, training, and evaluation jobs to Slurm clusters via nemo-run. Produces identical code/ layout as nmm-sandbox's slurm.py so the same YAML configs work in both. Includes Megatron-LM and Model-Optimizer as submodules.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Release Notes
New Features
Documentation